1. The main content of project
This study focused on 10 water quality monitoring sites in the North
Island of New Zealand, systematically analyzing water quality
time-series data from 2019 to 2023. It employed a comprehensive approach
combining data preprocessing, data validation, time series modeling
(ARIMA), non-metric multidimensional scaling analysis (nMDS), and
K-means clustering to fully reveal the spatiotemporal variation
characteristics and regional heterogeneity of water quality.
Through ARIMA models, nMDS, and K-means clustering, this study
systematically analyzed the evolution patterns and spatial distribution
differences of regional water quality from both temporal and spatial
dimensions. Based on scientific data modeling and clustering analysis,
key indicators influencing water quality can be identified, temporal
changes in water quality tracked, and spatial heterogeneity among
different water bodies explored. This facilitates targeted
identification of potential pollution sources in each water body and the
implementation of corresponding protective measures.
Additionally, this study established a comprehensive spatio-temporal
water quality analysis workflow integrating time series models,
clustering analysis, and multidimensional scaling visualization methods,
demonstrating strong generalizability and applicability to water quality
dynamic analysis in other basins or countries.
The figure below shows the locations of the water sampling sites. The
dataset is sourced from NIWA
and includes water quality data from 10 sampling sites in the North
Island of New Zealand between 2019 and 2023.
This study selected eight key water quality indicators:
- Water Temperature (Water.Temp, °C)
- Turbidity (Turbidity, FNU)
- Dissolved Oxygen Saturation (Dis.Oxygen.Sat, %)
- pH (pH.25DegC)
- Nitrate and Nitrite as Nitrogen (NO3_NO2_N, ppb)
- Total Phosphorus (P_Tot, ppb)
- Total Coliforms (Tot.Coliforms, MPN/100ml)
- Escherichia coli (E.Coli, MPN/100ml)
Water Quality sampling site
2. Preprocessing data
2.1 Loading the essential library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(tidyr)
library(ggrepel)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(tseries)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
2.2 Load and Prepare Water Quality Data
This code reads, cleans, and merges multiple raw CSV files. It
standardizes column name formats and adds location identifiers,
ultimately saving the organized data to a new file.
# Set folder path and get CSV file list
folder_path <- "../raw_data"
file_list <- list.files(path = folder_path, pattern = "\\.csv$", full.names = TRUE)
data_list <- list()
first_file <- TRUE
for(file in file_list) {
if(first_file){
# Read first file and get column names and units
df <- read.csv(file, skip = 3, stringsAsFactors = FALSE)
col_names <- colnames(df)
unit <- df[1,2:ncol(df)]
first_file <- FALSE
}else{
# Read other files with same column names
df <- read.csv(file, skip = 4, stringsAsFactors = FALSE)
colnames(df) <- col_names
}
# Convert data columns to character
df[,2:ncol(df)] <- lapply(df[, 2:ncol(df)], as.character)
# Add location name
location_name <- tools::file_path_sans_ext(basename(file))
df$Location <- location_name
data_list[[length(data_list) + 1]] <- df
}
# Combine all data
combined_data <- bind_rows(data_list)
head(combined_data)
## X Water.Temp.NRWQN Turbidity..Form.Neph..NRWQN
## 1 Timestamp (UTC+12:00) Value (°C) Value (FNU)
## 2 2019-01-08 09:20:00 22.600000000000023 7.1
## 3 2019-02-04 09:05:00 22.69999999999999 2.8
## 4 2019-03-05 09:15:00 19.69999999999999 3.5
## 5 2019-04-11 09:40:00 14.899999999999977 5.4
## 6 2019-05-07 09:20:00 12.899999999999977 3.2
## Dis.Oxygen.Sat.NRWQN pH.25DegC.NRWQN NO3.NO2.as.N.NRWQN P..Tot..NRWQN
## 1 Value (%) Value (pH) Value (ppb) Value (ppb)
## 2 88.8 7.3 129 47
## 3 90.8 7.8 1 29
## 4 88.4 7.7 1 33
## 5 87.4 9.4 170 41
## 6 91.8 7.6 10 31
## Tot.Coliforms.NRWQN E.Coli.NRWQN Location
## 1 Value (MPN/100ml) Value (MPN/100ml) AK1
## 2 1986 119 AK1
## 3 2359 63 AK1
## 4 1723 175 AK1
## 5 2419 60 AK1
## 6 613 114 AK1
# Export to CSV
write.csv(combined_data, file = "../derive_data/combined.csv", row.names = FALSE)
2.3 Clean and Organize data
This code further cleans and organizes the merged data: it resets the
column names, merges the original variable names with the units, and
converts the data values to numeric types, ultimately producing a
standardized data table with a clear structure that is easy to
analyze.
# Read combined CSV without header
combined_data <- read.csv("../derive_data/combined.csv", header=FALSE, stringsAsFactors=FALSE)
# Extract variable names and units
name <- combined_data[1,2:(ncol(combined_data)-1)]
combined_data[2,2:ncol(combined_data)] <- "Location"
combined_data[2, 1] <- "Date"
# Combine variable name and unit
new_col <- mapply(function(name, unit){
name_new <- sub("\\.[^.]*$","",name)
unit_new <- sub("Value", name_new, unit)
return(unit_new)
}, name,unit)
# Apply new column names
combined_data[2,2:(ncol(combined_data)-1)] <- new_col
new_col <- combined_data[2,]
combined_data <- combined_data[-c(1,2),]
# Convert numeric columns
colnames(combined_data) <- new_col
df[,2:(ncol(df)-1)] <- lapply(df[, 2:(ncol(df)-1)], as.numeric)
head(combined_data)
## Date Water.Temp (°C) Turbidity..Form.Neph. (FNU)
## 3 2019-01-08 09:20:00 22.600000000000023 7.1
## 4 2019-02-04 09:05:00 22.69999999999999 2.8
## 5 2019-03-05 09:15:00 19.69999999999999 3.5
## 6 2019-04-11 09:40:00 14.899999999999977 5.4
## 7 2019-05-07 09:20:00 12.899999999999977 3.2
## 8 2019-06-05 09:50:00 11.600000000000023 3.6
## Dis.Oxygen.Sat (%) pH.25DegC (pH) NO3.NO2.as.N (ppb) P..Tot. (ppb)
## 3 88.8 7.3 129 47
## 4 90.8 7.8 1 29
## 5 88.4 7.7 1 33
## 6 87.4 9.4 170 41
## 7 91.8 7.6 10 31
## 8 94.3 7.7 244 45
## Tot.Coliforms (MPN/100ml) E.Coli (MPN/100ml) Location
## 3 1986 119 AK1
## 4 2359 63 AK1
## 5 1723 175 AK1
## 6 2419 60 AK1
## 7 613 114 AK1
## 8 1414 55 AK1
2.4 Describe data
This code further cleans and organizes the merged data: it resets the
column names, merges the original variable names with the units, and
converts the data values to numeric types, ultimately producing a
standardized data table with a clear structure that is easy to
analyze.
From the line chart, several significant abnormal water quality
indicators can be observed, such as:
- abnormally high concentrations of E. coli and
total coliform bacteria at the WA1 site.
- abnormally high turbidity and total
phosphorus levels at the WA5 site.
- abnormally low dissolved oxygen levels at the AK1
site.
Additionally, certain indicators such as NO₃+NO₂ and
water temperature exhibit pronounced seasonal
fluctuations.
Based on these findings, further analysis can be conducted in two
directions:
on one hand, to classify water quality and identify key abnormal
indicators;
on the other hand, to test the stationarity and non-white noise
characteristics of the time series, providing a foundation for
subsequent predictive modeling.
# Check structure, column names and first rows
str(combined_data)
## 'data.frame': 553 obs. of 10 variables:
## $ Date : chr "2019-01-08 09:20:00" "2019-02-04 09:05:00" "2019-03-05 09:15:00" "2019-04-11 09:40:00" ...
## $ Water.Temp (°C) : chr "22.600000000000023" "22.69999999999999" "19.69999999999999" "14.899999999999977" ...
## $ Turbidity..Form.Neph. (FNU): chr "7.1" "2.8" "3.5" "5.4" ...
## $ Dis.Oxygen.Sat (%) : chr "88.8" "90.8" "88.4" "87.4" ...
## $ pH.25DegC (pH) : chr "7.3" "7.8" "7.7" "9.4" ...
## $ NO3.NO2.as.N (ppb) : chr "129" "1" "1" "170" ...
## $ P..Tot. (ppb) : chr "47" "29" "33" "41" ...
## $ Tot.Coliforms (MPN/100ml) : chr "1986" "2359" "1723" "2419" ...
## $ E.Coli (MPN/100ml) : chr "119" "63" "175" "60" ...
## $ Location : chr "AK1" "AK1" "AK1" "AK1" ...
colnames(combined_data)
## [1] "Date" "Water.Temp (°C)"
## [3] "Turbidity..Form.Neph. (FNU)" "Dis.Oxygen.Sat (%)"
## [5] "pH.25DegC (pH)" "NO3.NO2.as.N (ppb)"
## [7] "P..Tot. (ppb)" "Tot.Coliforms (MPN/100ml)"
## [9] "E.Coli (MPN/100ml)" "Location"
head(combined_data)
## Date Water.Temp (°C) Turbidity..Form.Neph. (FNU)
## 3 2019-01-08 09:20:00 22.600000000000023 7.1
## 4 2019-02-04 09:05:00 22.69999999999999 2.8
## 5 2019-03-05 09:15:00 19.69999999999999 3.5
## 6 2019-04-11 09:40:00 14.899999999999977 5.4
## 7 2019-05-07 09:20:00 12.899999999999977 3.2
## 8 2019-06-05 09:50:00 11.600000000000023 3.6
## Dis.Oxygen.Sat (%) pH.25DegC (pH) NO3.NO2.as.N (ppb) P..Tot. (ppb)
## 3 88.8 7.3 129 47
## 4 90.8 7.8 1 29
## 5 88.4 7.7 1 33
## 6 87.4 9.4 170 41
## 7 91.8 7.6 10 31
## 8 94.3 7.7 244 45
## Tot.Coliforms (MPN/100ml) E.Coli (MPN/100ml) Location
## 3 1986 119 AK1
## 4 2359 63 AK1
## 5 1723 175 AK1
## 6 2419 60 AK1
## 7 613 114 AK1
## 8 1414 55 AK1
# Reshape data from wide to long format
data_long <- combined_data %>%
pivot_longer(cols = -c(Date, Location), names_to = "Variable", values_to = "Value")
# Clean and parse Date column
dates <- gsub("\\s+", " ", data_long$Date)
dates <- trimws(dates) # remove leading/trailing spaces
data_long$Date <- as.POSIXct(data_long$Date, format = "%Y-%m-%d %H:%M:%S", tz = "Pacific/Auckland")
# Save reshaped data
write.csv(data_long, "../derive_data/data_long.csv", row.names = FALSE)
data_long$Value <- as.numeric(data_long$Value)
print(data_long)
## # A tibble: 4,424 × 4
## Date Location Variable Value
## <dttm> <chr> <chr> <dbl>
## 1 2019-01-08 09:20:00 AK1 Water.Temp (°C) 22.6
## 2 2019-01-08 09:20:00 AK1 Turbidity..Form.Neph. (FNU) 7.1
## 3 2019-01-08 09:20:00 AK1 Dis.Oxygen.Sat (%) 88.8
## 4 2019-01-08 09:20:00 AK1 pH.25DegC (pH) 7.3
## 5 2019-01-08 09:20:00 AK1 NO3.NO2.as.N (ppb) 129
## 6 2019-01-08 09:20:00 AK1 P..Tot. (ppb) 47
## 7 2019-01-08 09:20:00 AK1 Tot.Coliforms (MPN/100ml) 1986
## 8 2019-01-08 09:20:00 AK1 E.Coli (MPN/100ml) 119
## 9 2019-02-04 09:05:00 AK1 Water.Temp (°C) 22.7
## 10 2019-02-04 09:05:00 AK1 Turbidity..Form.Neph. (FNU) 2.8
## # ℹ 4,414 more rows
# Plot time series trends of variables by location
ggplot(data_long, aes(x = Date, y = Value, color = Location)) +
geom_line(linewidth = 1) +
facet_wrap(~ Variable, scales = "free") +
labs(title = "Trends of various indicators in different regions", x = "Date", y = "Value") +
theme_bw() +
theme(legend.position = "bottom")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

3. Classify the water quality across the time
3.1 Score each water quality
The main function of this code is to score each record based on
specific water quality variable values and analyze the scoring
results.
- It performs scoring function transformations on certain variables,
calculates average scores by month and location, and
visually displays the trends in water quality scores across different
regions.
- It highlights the temporal changes in variables with low scores (1
or 2), making it easier to identify pollution risk points and abnormal
indicators.
The following are our scoring criteria for each indicator:
| Turbidity |
FNU |
<30 |
30–60 |
60–100 |
>100 |
| Dissolved Oxygen |
% |
>80 |
60–80 |
40–60 |
<40 |
| pH |
– |
6.5–8.5 |
6–6.5 or 8.5–9 |
5.5–6 or 9–9.5 |
<5.5 or >9.5 |
| NO3 + NO2 |
ppb |
<1000 |
1000–3000 |
3000–6000 |
>6000 |
| Total Phosphorus |
ppb |
<50 |
50–150 |
150–300 |
>300 |
| Total Coliforms |
MPN/100mL |
<1000 |
1000–3000 |
3000–5000 |
>5000 |
| E. Coli |
MPN/100mL |
<260 |
260–1000 |
1000–2400 |
>2400 |
We ignored the impact of water temperature because,
as shown in the line graph above, water temperature remained at normal
levels throughout the seasons and did not accurately represent water
quality. Therefore, we only scored the other seven indicators.
# Define scoring function based on variable and its value
score_function <- function(var, val) {
if (var == "Turbidity..Form.Neph. (FNU)") {
if (val < 30) return(4)
else if (val < 60) return(3)
else if (val < 100) return(2)
else return(1)
} else if (var == "Dis.Oxygen.Sat (%)") {
if (val > 80) return(4)
else if (val > 60) return(3)
else if (val > 40) return(2)
else return(1)
} else if (var == "pH.25DegC (pH)") {
if (val >= 6.5 & val <= 8.5) return(4)
else if ((val >= 6 & val < 6.5) | (val > 8.5 & val <= 9)) return(3)
else if ((val >= 5.5 & val < 6) | (val > 9 & val <= 9.5)) return(2)
else return(1)
} else if (var == "NO3.NO2.as.N (ppb)") {
if (val < 1000) return(4)
else if (val < 3000) return(3)
else if (val < 6000) return(2)
else return(1)
} else if (var == "P..Tot. (ppb)") {
if (val < 50) return(4)
else if (val < 150) return(3)
else if (val < 300) return(2)
else return(1)
} else if (var == "Tot.Coliforms (MPN/100ml)") {
if (val < 1000) return(4)
else if (val < 3000) return(3)
else if (val < 5000) return(2)
else return(1)
} else if (var == "E.Coli (MPN/100ml)") {
if (val < 260) return(4)
else if (val < 1000) return(3)
else if (val < 2400) return(2)
else return(1)
} else {
return(NA)
}
}
# Apply scoring to filtered data (exclude missing and temperature)
score_data <- data_long %>%
filter(!is.na(Value) & Variable != "Water.Temp (°C)")%>%
rowwise() %>%
mutate(Score = score_function(Variable, Value)) %>%
ungroup()
print(score_data)
## # A tibble: 3,842 × 5
## Date Location Variable Value Score
## <dttm> <chr> <chr> <dbl> <dbl>
## 1 2019-01-08 09:20:00 AK1 Turbidity..Form.Neph. (FNU) 7.1 4
## 2 2019-01-08 09:20:00 AK1 Dis.Oxygen.Sat (%) 88.8 4
## 3 2019-01-08 09:20:00 AK1 pH.25DegC (pH) 7.3 4
## 4 2019-01-08 09:20:00 AK1 NO3.NO2.as.N (ppb) 129 4
## 5 2019-01-08 09:20:00 AK1 P..Tot. (ppb) 47 4
## 6 2019-01-08 09:20:00 AK1 Tot.Coliforms (MPN/100ml) 1986 3
## 7 2019-01-08 09:20:00 AK1 E.Coli (MPN/100ml) 119 4
## 8 2019-02-04 09:05:00 AK1 Turbidity..Form.Neph. (FNU) 2.8 4
## 9 2019-02-04 09:05:00 AK1 Dis.Oxygen.Sat (%) 90.8 4
## 10 2019-02-04 09:05:00 AK1 pH.25DegC (pH) 7.8 4
## # ℹ 3,832 more rows
# Calculate average monthly score by Location
monthly_scores <- score_data %>%
mutate(YearMonth = floor_date(Date, "month")) %>%
group_by(Location, YearMonth) %>%
summarise(Average_Score = mean(Score, na.rm = TRUE), .groups = "drop")
3.2 visualize water quality score results
This code is primarily used for visualizing and analyzing water
quality score results. First, monthly average water quality score trend
charts were plotted for each monitoring site to identify overall water
quality changes. Subsequently, low-score data with scores of 1 or 2 were
filtered, and trend charts for abnormal indicators were plotted
separately for coliform bacteria and other indicators
(such as pH, total phosphorus, and turbidity) to
further reveal key water quality issues in different regions. Finally,
low-score pH data were extracted for further
analysis.
- The results showed that RO6 and WN5 were the areas with the best
water quality, with no indicators scoring below B.
- RO6 is located near Taopu Lake, where water quality is well
managed.
- WN5 is located in the southernmost mountainous area of the North
Island, surrounded by no urban development and only crossed by a
highway, where water quality is well protected.
Other characteristics include: HV3 performed excellently in terms
of total coliform bacteria and Escherichia
coli; GS3 performed excellently in terms of total
phosphorus and turbidity.
In areas with poor water quality, AK1, WA5, and WA1 experienced
multiple instances of abnormal E. coli and total coliform
counts. Similarly, these three locations performed poorly in
terms of total phosphorus and turbidity. They are
located in Hoteo at Gubbs, Waitara at Bertrand Rd, and Rangitikei at
Mangaweka, respectively. Further analysis of land use types in these
three locations is needed to better explain the causes.
# Plot monthly average water quality score by location
ggplot(monthly_scores, aes(x = YearMonth, y = Average_Score)) +
geom_line(linewidth = 1, color = "blue") +
labs(title = "Monthly Average Water Quality Score by Location",
x = "Month", y = "Average Score") +
facet_wrap(~Location, scales = "free") +
theme_bw() +
theme(legend.position = "bottom")

# Filter low scores (1 or 2)
low_scores <- score_data %>%
mutate(YearMonth = floor_date(Date, "month")) %>%
filter(Score %in% c(1, 2))
# Plot low-score trends for E.Coli and Tot.Coliforms
ggplot(low_scores %>% filter(Variable %in% c("E.Coli (MPN/100ml)", "Tot.Coliforms (MPN/100ml)")), aes(x = Date, y = Value, color = Variable)) +
geom_line(size = 1) +
geom_text(aes(label = round(Value, 1)),
size = 2.5,
vjust = -0.5,
check_overlap = TRUE) +
labs(title = "Monthly low Water Quality Score by Location for Coliforms",
x = "Month", y = "low Score") +
facet_wrap(~Location, scales = "free_y") +
theme_bw() +
theme(legend.position = "bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Plot low-score trends for other variables excluding coliforms
ggplot(low_scores %>% filter(!Variable %in% c("E.Coli (MPN/100ml)", "Tot.Coliforms (MPN/100ml)")), aes(x = Date, y = Value, color = Variable)) +
geom_line(linewidth = 1) +
geom_text(aes(label = round(Value, 1)),
size = 2.5,
vjust = -0.5,
check_overlap = TRUE) +
labs(title = "Monthly low Water Quality Score by Location for PH, P.Tot. (ppb), Turbidity",
x = "Month", y = "low Score") +
facet_wrap(~Location, scales = "free_y") +
theme_bw() +
theme(legend.position = "bottom")
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Filter specific low score subsets for inspection
low_ph_scores <- low_scores %>% filter(Variable == "pH.25DegC (pH)")
low_Coli_HV3 <- low_scores %>% filter(Variable %in% c("E.Coli (MPN/100ml)", "Tot.Coliforms (MPN/100ml)") & Location =="HV3")
print(low_ph_scores)
## # A tibble: 2 × 6
## Date Location Variable Value Score YearMonth
## <dttm> <chr> <chr> <dbl> <dbl> <dttm>
## 1 2019-04-11 09:40:00 AK1 pH.25DegC (pH) 9.4 2 2019-04-01 00:00:00
## 2 2023-01-24 11:23:00 GS3 pH.25DegC (pH) 9.8 1 2023-01-01 00:00:00
print(low_Coli_HV3)
## # A tibble: 2 × 6
## Date Location Variable Value Score YearMonth
## <dttm> <chr> <chr> <dbl> <dbl> <dttm>
## 1 2022-06-14 13:50:00 HV3 Tot.Coliforms (M… 3873 2 2022-06-01 00:00:00
## 2 2023-11-21 13:55:00 HV3 E.Coli (MPN/100m… 1300 2 2023-11-01 00:00:00
print(low_scores)
## # A tibble: 196 × 6
## Date Location Variable Value Score YearMonth
## <dttm> <chr> <chr> <dbl> <dbl> <dttm>
## 1 2019-04-11 09:40:00 AK1 pH.25DegC (pH) 9.4 e0 2 2019-04-01 00:00:00
## 2 2019-09-04 09:25:00 AK1 Turbidity..For… 1.23e2 1 2019-09-01 00:00:00
## 3 2019-09-04 09:25:00 AK1 P..Tot. (ppb) 2.62e2 2 2019-09-01 00:00:00
## 4 2019-09-04 09:25:00 AK1 Tot.Coliforms … 2.42e4 1 2019-09-01 00:00:00
## 5 2019-09-04 09:25:00 AK1 E.Coli (MPN/10… 1.99e4 1 2019-09-01 00:00:00
## 6 2019-10-03 08:45:00 AK1 E.Coli (MPN/10… 1.73e3 2 2019-10-01 00:00:00
## 7 2020-03-04 09:20:00 AK1 E.Coli (MPN/10… 1.41e3 2 2020-03-01 00:00:00
## 8 2020-07-07 09:35:00 AK1 E.Coli (MPN/10… 1.05e3 2 2020-07-01 00:00:00
## 9 2021-12-14 09:00:00 AK1 Tot.Coliforms … 2.42e4 1 2021-12-01 00:00:00
## 10 2021-12-14 09:00:00 AK1 E.Coli (MPN/10… 3.45e3 1 2021-12-01 00:00:00
## # ℹ 186 more rows
4. Analyse water quality - Time-series data
4.1 Analyse water quality - Non_stationary test
The main purpose of this code is to test the stationarity of time
series data corresponding to each monitoring point (Location) and water
quality indicator (Variable). The specific process includes:
- aggregating the data by month and taking the average;
- using linear interpolation to fill in missing values and
constructing a time series object;
- using the ADF (Augmented Dickey-Fuller) test to determine whether
the time series is stationary, and outputting the p-value and the
determination result for each set of data.
# Split data_long by Location and Variable into a list
data_list <- data_long %>%
group_by(Location, Variable) %>%
group_split()
results <- list()
# Create full YearMonth sequence to ensure continuous monthly data
full_dates <- data.frame(YearMonth = seq.Date(from = as.Date("2019-01-01"),
to = as.Date("2023-12-01"),
by = "month"))
# For each group, aggregate monthly means, fill missing months by interpolation, then run ADF test
results <- map(data_list, function(df) {
df <- df %>% arrange(Date)
df <- df %>%
mutate(YearMonth = floor_date(as.Date(Date), unit = "month")) %>%
group_by(YearMonth) %>%
summarise(Value = mean(as.numeric(Value), na.rm = TRUE),.groups = "drop")
# Left join to get full monthly timeline, with NAs for missing months
df_full <- full_dates %>%
left_join(df, by = "YearMonth") %>%
arrange(YearMonth)
# Linear interpolation for missing values
df_full$Value <- zoo::na.approx(df_full$Value, na.rm = FALSE)
values <- as.numeric(df_full$Value)
# Convert to time series object
ts_obj <- ts(values, start = c(2019, 1), frequency = 12)
res <- suppressWarnings(adf.test(ts_obj))
return(res)
})
# Summarize test results into a table with p-values and stationarity flag
summary_table <- map_dfr(seq_along(results), function(i) {
test_result <- results[[i]]
tibble(
ID = i,
p_value = test_result$p.value,
stationary = ifelse(test_result$p.value < 0.05, TRUE, FALSE)
)
})
print(summary_table)
## # A tibble: 80 × 3
## ID p_value stationary
## <int> <dbl> <lgl>
## 1 1 0.0306 TRUE
## 2 2 0.0442 TRUE
## 3 3 0.01 TRUE
## 4 4 0.479 FALSE
## 5 5 0.01 TRUE
## 6 6 0.560 FALSE
## 7 7 0.01 TRUE
## 8 8 0.01 TRUE
## 9 9 0.0301 TRUE
## 10 10 0.0286 TRUE
## # ℹ 70 more rows
The main purpose of this code is to filter out time series data that
has passed the ADF test for stationarity. Specifically, based on the
previously generated test results table, it extracts the indices
corresponding to all sequences determined to be stationary (stationary
== TRUE) and extracts these sequences from the original data list and
saves them to the stationary_data list. The reason for not performing
forced differencing is that it is easy to eliminate the trend component
of the data. Extract stationary data for subsequent white noise
testing.
# Extract IDs of stationary time series from summary table
stationary_ids <- summary_table %>%
filter(stationary == TRUE) %>%
pull(ID)
# Collect corresponding data frames from data_list
stationary_data <- list()
for (j in seq_along(stationary_ids)) {
i <- stationary_ids[j] # Get the position of non-stationary sequences in data_list
stationary_data[[j]] <- data_list[[i]]
}
# Combine all stationary time series into one data frame
stationary_df <- bind_rows(stationary_data)
head(stationary_df)
## # A tibble: 6 × 4
## Date Location Variable Value
## <dttm> <chr> <chr> <dbl>
## 1 2019-01-08 09:20:00 AK1 Dis.Oxygen.Sat (%) 88.8
## 2 2019-02-04 09:05:00 AK1 Dis.Oxygen.Sat (%) 90.8
## 3 2019-03-05 09:15:00 AK1 Dis.Oxygen.Sat (%) 88.4
## 4 2019-04-11 09:40:00 AK1 Dis.Oxygen.Sat (%) 87.4
## 5 2019-05-07 09:20:00 AK1 Dis.Oxygen.Sat (%) 91.8
## 6 2019-06-05 09:50:00 AK1 Dis.Oxygen.Sat (%) 94.3
4.2 Analyse water quality - white noise test
The primary function of this code is to further analyze whether the
previously screened stationary time series are white noise. The specific
process includes:
- Aggregating monthly data for each stable monitoring point and
indicator;
- Using the Box test (Ljung-Box) to test whether the
time series is white noise;
- Recording the test results (p-value, statistics) for each group,
along with the corresponding monitoring points and variable
information;
- Finally, summarizing the results into a data frame, screening out
sequences that are significantly not white noise, and listing the
corresponding locations and indicators.
# Apply white noise test (Ljung-Box) to each stationary time series
white_noise_results <- map(stationary_data, function(df) {
df <- df %>% arrange(Date)
Locations <- df$Location
Var <- df$Variable
# Aggregate to monthly mean values
df <- df %>%
mutate(YearMonth = floor_date(as.Date(Date), unit = "month")) %>%
group_by(YearMonth) %>%
summarise(Value = mean(as.numeric(Value), na.rm = TRUE),.groups = "drop")
# Fill missing months and interpolate missing values
df_full <- full_dates %>%
left_join(df, by = "YearMonth") %>%
arrange(YearMonth)
df_full$Value <- zoo::na.approx(df_full$Value, na.rm = FALSE)
values <- as.numeric(df_full$Value)
# Create time series object and run Ljung-Box test
ts_obj <- ts(df_full$Value, start = c(2019, 1), frequency = 12)
test_res <- Box.test(ts_obj, lag = 12, type = "Ljung-Box")
# Return test result with metadata
list(
Location = unique(Locations),
Variable = unique(Var),
value = df_full$Value,
date = df_full$YearMonth,
p_value = test_res$p.value,
statistic = test_res$statistic,
white_noise = ifelse(test_res$p.value < 0.05, TRUE, FALSE)
)
})
# Convert results to data frame for summary
white_noise_summary <- bind_rows(white_noise_results)
print(white_noise_summary)
## # A tibble: 3,720 × 7
## Location Variable value date p_value statistic white_noise
## <chr> <chr> <dbl> <date> <dbl> <dbl> <lgl>
## 1 AK1 Dis.Oxygen.Sat (%) 88.8 2019-01-01 0.00758 27.1 TRUE
## 2 AK1 Dis.Oxygen.Sat (%) 90.8 2019-02-01 0.00758 27.1 TRUE
## 3 AK1 Dis.Oxygen.Sat (%) 88.4 2019-03-01 0.00758 27.1 TRUE
## 4 AK1 Dis.Oxygen.Sat (%) 87.4 2019-04-01 0.00758 27.1 TRUE
## 5 AK1 Dis.Oxygen.Sat (%) 91.8 2019-05-01 0.00758 27.1 TRUE
## 6 AK1 Dis.Oxygen.Sat (%) 94.3 2019-06-01 0.00758 27.1 TRUE
## 7 AK1 Dis.Oxygen.Sat (%) 91.7 2019-07-01 0.00758 27.1 TRUE
## 8 AK1 Dis.Oxygen.Sat (%) 88.2 2019-08-01 0.00758 27.1 TRUE
## 9 AK1 Dis.Oxygen.Sat (%) 84.7 2019-09-01 0.00758 27.1 TRUE
## 10 AK1 Dis.Oxygen.Sat (%) 92.7 2019-10-01 0.00758 27.1 TRUE
## # ℹ 3,710 more rows
# List variables considered white noise and their associated locations
white_noise_vdata <- white_noise_summary %>%
filter(white_noise) %>%
select(Location, Variable) %>%
group_by(Variable) %>%
summarise(locations = paste(unique(Location), collapse = ", "))
print(white_noise_vdata)
## # A tibble: 8 × 2
## Variable locations
## <chr> <chr>
## 1 Dis.Oxygen.Sat (%) AK1, HV3, RO6
## 2 E.Coli (MPN/100ml) RO6
## 3 NO3.NO2.as.N (ppb) AK1, GS3, HM2, WA1, WA2, WA5, WA7, WN5
## 4 P..Tot. (ppb) RO6, WA7
## 5 Tot.Coliforms (MPN/100ml) HV3, WA2, WN5
## 6 Turbidity..Form.Neph. (FNU) GS3
## 7 Water.Temp (°C) AK1, GS3, HM2, HV3, RO6, WA1, WA2, WA5, WA7, WN5
## 8 pH.25DegC (pH) AK1, HV3, RO6
4.3 Analyse water quality - ARIMA model
Based on the above verification, it was found that water temperature
and NO3.NO2 indicators are stable and non-white noise sequences in most
regions, so an appropriate model can be selected for model training and
prediction.
We selected the water temperature time series from the HV3 location
for modeling and prediction, with similar analysis processes applied to
other time series.
- Extracted the water temperature data from this location and
constructed a monthly time series.
- Used the auto.arima() function to automatically select the optimal
ARIMA model for fitting, plotted the fitting results, and generated a
forecast for the next 12 months’ water temperatures.
- Output the prediction results.
The model results show that the ARIMA model reasonably captures the
seasonal and non-seasonal changes in water temperature data. The
parameter estimates are stable and significant, and the error indicators
(such as RMSE, MAE, and MAPE) indicate that the model fits well. The
residuals show no obvious autocorrelation, indicating that the model has
good and reliable predictive performance.
The plotted fitting results are consistent with the original data
trend.
# Extract data for Water Temperature at location HV3
model_data <- white_noise_summary %>% select(date, Location, Variable, value)
# Convert to time series object (monthly frequency)
watertmp_HV3_data <- model_data %>% filter(Variable == "Water.Temp (°C)" & Location == "HV3")
# Fit ARIMA model
watertmp_HV3_ts <- ts(watertmp_HV3_data$value, start = c(2019, 1), frequency = 12)
model <- auto.arima(watertmp_HV3_ts)
# Show model summary
summary(model)
## Series: watertmp_HV3_ts
## ARIMA(0,0,1)(1,1,1)[12]
##
## Coefficients:
## ma1 sar1 sma1
## 0.3774 -0.2601 -0.5450
## s.e. 0.1392 0.2892 0.3869
##
## sigma^2 = 4.012: log likelihood = -104.08
## AIC=216.17 AICc=217.1 BIC=223.65
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1212822 1.734694 1.294984 -1.143345 8.597814 0.6139185
## ACF1
## Training set 0.0186704
# Plot observed vs fitted values
autoplot(watertmp_HV3_ts) +
autolayer(fitted(model), series = "Fitted") +
ggtitle("Observed vs Fitted values") +
ylab("Water Temperature") +
xlab("Time")

# Forecast next 12 months
future <- forecast(model, h = 12)
# Plot forecast
autoplot(future) +
ggtitle("Forecast of Water Temperature") +
xlab("Time") + ylab("Temperature")

print(future)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2024 22.81264 20.241104 25.38418 18.879815 26.74547
## Feb 2024 21.53825 18.790737 24.28577 17.336289 25.74022
## Mar 2024 20.18010 17.432580 22.92761 15.978132 24.38206
## Apr 2024 17.42398 14.676467 20.17150 13.222020 21.62595
## May 2024 13.78116 11.033646 16.52868 9.579199 17.98312
## Jun 2024 11.37134 8.623822 14.11885 7.169375 15.57330
## Jul 2024 10.19565 7.448136 12.94317 5.993689 14.39761
## Aug 2024 10.70544 7.957924 13.45296 6.503477 14.90740
## Sep 2024 13.30957 10.562057 16.05709 9.107610 17.51154
## Oct 2024 16.14781 13.400295 18.89533 11.945848 20.34977
## Nov 2024 18.07606 15.328541 20.82357 13.874094 22.27802
## Dec 2024 17.92570 15.178513 20.67289 13.724238 22.12717
5. Spatial Water Analyse for each year - nMDS and K_means
clustering
5.1 Spatial Water Analyse for each year - nMDS
The primary function of this code is to perform non-metric
multidimensional scaling (nMDS) on water quality indicators for each
region from 2019 to 2023, in order to visualize the similarities or
differences in overall water quality characteristics among regions over
the five-year period.
The code first calculates the annual average values of each water
quality indicator for each region by year, constructs a Bray-Curtis
distance matrix, and performs a two-dimensional nMDS dimensionality
reduction analysis. It then outputs the corresponding coordinate results
for each year and uses ggplot2 to plot the distribution patterns of each
region in the nMDS space for each year, making it easier to identify
trends in water quality characteristics over time and clustering
structures between regions.
- In 2019,the results of the nMDS analysis show that the similarity
of water quality across locations was relatively concentrated,
indicating that water quality indicators were consistent across
regions.
- In 2020, samples exhibited significant dispersion along the second
dimension of the nMDS, reflecting the emergence of regional differences
in water quality characteristics.
- In 2021, the 10 monitoring sites had clustered into three distinct
categories, with high intra-category similarity, suggesting that water
quality conditions were relatively consistent within each category.
- In 2022, the pattern of water quality similarity underwent
significant changes, with the HV3 and WA1 sites beginning to show
noticeable dispersion, resulting in five distinct categories, indicating
an increase in water quality differences.
- In 2023, similarity between samples further decreased, primarily
manifested in the dispersion along the first dimension of the nMDS,
indicating a significant enhancement in the spatial diversity and
heterogeneity of water quality conditions.
data_long_nmds <- data_long %>%
mutate(Year = format(as.Date(Date), "%Y"))
years <- 2019:2023
nmds_results <- list()
for (yr in years) {
# Filter the data of the current year
year_data <- data_long_nmds %>% filter(Year == as.character(yr))
# Construct a wide table of "region × indicator" : Calculate the average value for each indicator in each region
mat <- year_data %>%
group_by(Location, Variable) %>%
summarise(mean_value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
pivot_wider(names_from = Variable, values_from = mean_value)
# Remove the "Location" column to obtain a pure numerical matrix
dist_mat <- mat %>% select(-Location) %>%
as.data.frame() %>%
mutate(across(everything(), as.numeric))
# Set the trade name as the region
rownames(dist_mat) <- mat$Location
# Calculate the distance matrix (Bray-Curtis)
bray_dist <- vegdist(dist_mat, method = "bray")
# operation nMDS (k=2 dimension)
suppressWarnings({
invisible(capture.output({
nmds <- metaMDS(bray_dist, k = 2, trymax = 50)
}))
})
# Extract nMDS coordinates and convert them to a data frame.
nmds_points <- as.data.frame(nmds$points)
nmds_points$Location <- rownames(nmds_points)
nmds_points$Year <- yr
nmds_results[[as.character(yr)]] <- nmds_points
}
nmds_all <- bind_rows(nmds_results)
head(nmds_all)
## MDS1 MDS2 Location Year
## AK1...1 -0.6084281 0.3273470 AK1 2019
## GS3...2 -0.2398702 -0.3031484 GS3 2019
## HM2...3 -0.3936827 -0.1875671 HM2 2019
## HV3...4 0.6383452 -0.1927208 HV3 2019
## RO6...5 0.8792933 0.4506501 RO6 2019
## WA1...6 -0.6084671 0.3263781 WA1 2019
# Draw a graph: x and y are nMDS coordinates.
ggplot(nmds_all, aes(x = MDS1, y = MDS2, color = Location)) +
geom_point(size = 3) +
facet_wrap(~ Year) +
theme_bw() +
labs(title = "nMDS analysis of water quality indicators in different regions of each year", x = "nMDS1", y = "nMDS2")

5.2 Spatial Water Analyse for each year - K-means clustering
5.2.1 Find the optimal number of clusters k
The code segment first groups the data_long data by year and
constructs a wide-format matrix of region × indicator for each year to
analyze the characteristic differences between different regions in
multiple water quality indicators. Subsequently, these numerical data
are standardized, and the Silhouette method is used to evaluate the
clustering effectiveness under different numbers of clusters. The
optimal number of clusters k for each year is visualized using
fviz_nbclust, providing a basis for subsequent K-means clustering.
Based on the evaluation results of the Silhouette method, we selected
the highest-scoring number of clusters each year as the initial optimal
k value. The optimal k values for each year are as follows: 2 for 2019,
3 for 2020, 2 for 2021, 4 for 2022, and 2 for 2023.
However, it is important to note that in 2019 and 2020, the
difference between the highest Silhouette score and the second-highest
score (when k=3 or k=2) is relatively small, indicating that there are
multiple similar clustering structures in these two years. Therefore, in
the subsequent K-means clustering analysis, we will simultaneously test
both k values to ensure the robustness and interpretability of the
clustering results.
# Add a 'Year' column to the long-format data
data_long_cluster <- data_long %>%
mutate(Year = format(as.Date(Date), "%Y"))
years <- 2019:2023
cluster_results <- list()
for (yr in years) {
# Filter data for the specific year
year_data <- data_long_cluster %>% filter(Year == as.character(yr))
# Create wide-format table: average value for each Variable per Location
mat <- year_data %>%
group_by(Location, Variable) %>%
summarise(mean_value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
pivot_wider(names_from = Variable, values_from = mean_value)
# Remove Location column and convert to numeric matrix
dist_mat <- mat %>% select(-Location) %>%
as.data.frame() %>%
mutate(across(everything(), as.numeric))
# Set row names as Location
rownames(dist_mat) <- mat$Location
# Standardize the data
scaled_data <- scale(dist_mat)
# Determine optimal number of clusters using silhouette method
print(fviz_nbclust(scaled_data, kmeans, method = "silhouette", k.max = 9) + # 假设你目测选择 k=3
labs(title = paste("Silhouette Method for Optimal k - ", yr)))
# Store the scaled data
cluster_results[[as.character(yr)]] <- scaled_data
}





6. Conclusion
This study systematically analyzed water quality data from 10
monitoring sites in the North Island of New Zealand between 2019 and
2023. By combining data cleaning, anomaly detection, time series
modeling, nMDS dimensionality reduction, and K-means clustering, the
study revealed both temporal changes and spatial differences in water
quality across regions.
First, through visualization and analysis of monthly average water
quality scores, several key abnormal indicators were identified. For
example, E. coli and total coliform concentrations were abnormally high
at the WA1 site; turbidity and total phosphorus were high at the WA5
site; and dissolved oxygen was unusually low at the AK1 site. These
findings suggest potential pollution problems in these regions. In
addition, NO₃+NO₂ and water temperature showed clear seasonal variation,
which is useful for future predictive modeling.
Additionally, based on further analysis of low scores (1 or 2), it
was found that RO6 and WN5 had overall good water quality, with no
indicators scoring below grade B. HV3 showed good performance for
coliform indicators, and GS3 had good results for phosphorus and
turbidity. In contrast, AK1, WA1, and WA5 showed multiple indicators
with poor results and frequent pollution, suggesting the need to
investigate local land use types to understand the causes.
The nMDS results revealed changing patterns of water quality
similarity across years. In 2019, sites were closely grouped, indicating
high similarity. In 2020, samples became more spread out along the
second dimension, suggesting regional differences. In 2021, three
clusters formed with high similarity within groups. In 2022, sites like
HV3 and WA1 became more distinct, forming five clusters. By 2023,
differences increased further, mainly along the first dimension, showing
more spatial diversity in water quality.
In the K-means clustering analysis, we chose k = [4, 3, 2, 4, 2] as
the best number of clusters for each year. The results showed clear
classification of the 10 sites based on their water quality
characteristics. Although clustering patterns changed over time, some
sites remained in the same cluster over the years. For example, AK1 and
HM2, HV3 and WA7, and RO6 and WN5 were often grouped together, showing
stable and similar water quality. WA1 often formed its own cluster,
especially in 2021–2022, indicating a unique water quality pattern. In
contrast, WA5 changed cluster frequently, reflecting large changes in
environmental conditions. GS3 and WA2 were in the same cluster in all
years except 2020, suggesting similar water quality in most years.
In addition, I conducted stationarity tests and white noise tests on
the time series data of each variable in each region, then selected one
of the stationary and non-white noise variable data sets for ARIMA model
training and prediction, achieving excellent fitting results.